1 Executive Summary

  • The report is aimed at exploring the influence of daily routines on body mass index (BMI).
  • The main discoveries are
    • Participants sleep late overall and have problems with their sleep status.
    • The shorter sleep duration, the higher BMI.
    • Weight is proportional to BMI.


2 Full Report

2.1 Initial Data Analysis (IDA)

  • Sleep debt and body anxiety have long been a major topic of concern. And BMI(Body Mass Index) is an important international standard used to measure the degree of obesity and health.

  • The surveybmi data set comes from group survey made by Wenjuanxing (https://www.wjx.cn/vj/hQSvcig.aspx), which is one of the most authoritative questionnaire platforms in China.

  • The questionnaire is designed with 13 variables, including 8 quantitative variables and 5 qualitative variables.This survey is released on April 18 for 2 days, with 155 participants.

  • Limitations:

    • Insufficient sample size and sample diversity, lack of sample data of 25-40 age group. Since there are 102 female and 53 male among the participants, sex ratio is uneven. Therefore the results may not be generalizable.
    • Recall bias caused by precise question setting.
# Read data
survey = read.csv("surveybmi.csv")

# quick view of the first 6 rows
head(survey)
##   X1.Your.age X2..Your.gender  X3..Your.job X4..Your.height X5..Your.weight
## 1          19          female undergraduate             170              48
## 2          18          female undergraduate             164              55
## 3          18          female undergraduate             167              60
## 4          19            male undergraduate             174              76
## 5          18          female undergraduate             160              55
## 6          18          female undergraduate             169              50
##   X7..What.time.do.you.sleep X8..How.long.do.you.sleep.every.day
## 1                 after 1 am                                 6.0
## 2              12 am to 1 am                                 8.0
## 3                 after 1 am                                 7.5
## 4             11 pm to 12 am                                 6.0
## 5              12 am to 1 am                                 8.0
## 6             11 pm to 12 am                                 8.0
##   X9..How.long.do.you.work.or.study.per.day
## 1                                        10
## 2                                         4
## 3                                        12
## 4                                        12
## 5                                        13
## 6                                        10
##   X10..How.long.do.you.exercise.per.day X13..How.is.your.sleeping.quality
## 1                                   2.0       often dream and not so good
## 2                                   0.0       occasionally dream and good
## 3                                   0.5       occasionally dream and good
## 4                                   2.0       occasionally dream and good
## 5                                   0.0       occasionally dream and good
## 6                                   2.0       often dream and not so good
##   X14..Your.sleep.quality   BMI How.long.you.sleep.per.day
## 1                       3 16.61                less than 7
## 2                       2 20.45                more than 7
## 3                       2 21.51                more than 7
## 4                       2 25.10                less than 7
## 5                       2 21.48                more than 7
## 6                       3 17.51                more than 7
# Create a new object of name
myname <- c(
  "age",
  "gender",
  "identity",
  "height",
  "weight",
  "resttime",
  "sleephour",
  "workhour",
  "sporthour",
  "sleepquality",
  "slpqual",
  "bmi",
  "slphour"
)

# assign the new object to survey
names(survey) <- myname

# quik view of data
str(survey)
## 'data.frame':    155 obs. of  13 variables:
##  $ age         : int  19 18 18 19 18 18 23 23 19 18 ...
##  $ gender      : chr  "female" "female" "female" "male" ...
##  $ identity    : chr  "undergraduate" "undergraduate" "undergraduate" "undergraduate" ...
##  $ height      : num  170 164 167 174 160 169 169 180 170 173 ...
##  $ weight      : num  48 55 60 76 55 50 55 70 60 52 ...
##  $ resttime    : chr  "after 1 am" "12 am to 1 am" "after 1 am" "11 pm to 12 am" ...
##  $ sleephour   : num  6 8 7.5 6 8 8 8 6 12 8 ...
##  $ workhour    : num  10 4 12 12 13 10 8 8 4 5 ...
##  $ sporthour   : num  2 0 0.5 2 0 2 0 4 1 2 ...
##  $ sleepquality: chr  "often dream and not so good" "occasionally dream and good" "occasionally dream and good" "occasionally dream and good" ...
##  $ slpqual     : int  3 2 2 2 2 3 4 1 2 3 ...
##  $ bmi         : num  16.6 20.4 21.5 25.1 21.5 ...
##  $ slphour     : chr  "less than 7" "more than 7" "more than 7" "less than 7" ...
  • The average of sleep duration is 7 hours a day, the longest is 12 hours, and the shortest is 5 hours.
# show the minimum, maximum, mean and median of sleeping hour
summary(survey$sleephour)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.000   6.500   7.000   7.294   8.000  12.000
library(plotly)
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
p = plot_ly(data = survey, x = ~sleephour, type = 'box')
p
  • BMI value varied significantly, with minimum and maximum 13.56 and 36.89 respectively. And presents a right-skewed distribution. Most of participants’ BMI are concentrated in the healthy range of 19-24, but 37.4% of the participants are still in a sub-healthy state of overweight or underweight.
# Show the maximum, minimum, mean and median of bmi
summary(survey$bmi)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.56   19.70   21.26   22.18   23.89   36.89
# Call the ggplot library
library(ggplot2)

# Create a histogram to show the distribution of BMI
p = ggplot(data = survey, aes(x = bmi))
p + geom_histogram(aes(y = ..density..),  binwidt = 0.5, fill = "lightblue", alph = 0.3) + geom_density(alpha=.2,fill = "red") + xlab('BMI')
## Warning: Ignoring unknown parameters: binwidt, alph
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


2.2 Research Question

2.2.1 Does sleep quality be affected by the time you fall asleep?

  • The number of people with poor sleep was similar in each time period so that the time to fall asleep is not the main determinant of sleep quality.
p = ggplot(data = survey, aes(x = resttime, fill = sleepquality))
p + geom_bar() + theme(panel.background = element_rect(fill = 'transparent', color = "gray"), axis.text.x = element_text(angle = 90, hjust = 0.5, vjust = 0.5, color = "black", size = 9))

2.2.2 Does sleep duration affect BMI?

  • The average BMI of short sleepers is slightly higher than that of normal sleepers by around 6.45%.
# calculate the average bmi of people who sleep less than median amount and more than median amount
mean(survey$bmi[survey$sleephour < 7])
## [1] 23.01952
mean(survey$bmi[survey$sleephour > 7])
## [1] 21.56296
  • This box plot below demonstrates the BMI of three groups of people who sleep less than, equal to, and more than 7 hours a day. Through comparing the median BMI of the three groups,people who sleep for less than 7 hours had the highest BMI(median 22.265), while those who sleep for more than 7 hours had the lowest BMI(median 20.73). However the result are not conclusive because most extreme samples of larger BMI are found in the longer sleeper group.
library(plotly)
p = plot_ly(data = survey, x = ~bmi, color = ~slphour, type = 'box')
p
  • In order to avoid being affected by extreme values, only data from lower threshold to upper threshold are selected for analyzing.
# Calculate the IQR of BMI
iqrbmi = IQR(survey$bmi)
# IQR of BMI = 4.18, q1 of bmi = 19.7, q3 of bmi = 23.89
# UT of BMI = q3 + 1.5IQR = 30.16
UT = 23.89 + 1.5*iqrbmi
# LT of BMI = q1 - 1.5IQR = 13.43
LT = 19.7 - 1.5*iqrbmi
# Select the data from LT to UT
surveymain = subset(survey, survey$bmi<UT & survey$bmi>LT)
# Calculate the value of correlation between sleep duration and BMI
cor(surveymain$sleephour, surveymain$bmi)
## [1] -0.2742257
  • The scatter plot below indicates a weak correlation with coefficient -0.27 between sleep duration and BMI.
# Plot the scatter plot of sleephour and BMI
c=ggplot(surveymain,aes(x=sleephour,y=bmi))
c+geom_point() + geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'


2.2.3 What is the correlation between weight and BMI?

  • The distribution in the scatter plot present an uptrend, so it seems that there’s an apparent linear relationship between weight and BMI.
#constract a scatter plot
c=ggplot(survey,aes(x=weight,y=bmi))
c+geom_point()

  • The correlation coefficient of weight and BMI is calculated to be 0.86, suggesting that there’s a strong positive correlation between the two.
#calculate the correlation coefficient
cor(survey$weight,survey$bmi)
## [1] 0.8614837
  • Obtained the intercept is 2.8462, and slope is 2.7478 by linear model fitting.
#calcutate the linear regresstion model
L=lm(survey$weight~survey$bmi)
#summary 
L$coeff
## (Intercept)  survey$bmi 
##    2.846189    2.747777
summary(L)
## 
## Call:
## lm(formula = survey$weight ~ survey$bmi)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.632  -4.338  -1.264   4.362  16.263 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.8462     2.9490   0.965    0.336    
## survey$bmi    2.7478     0.1309  20.985   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.335 on 153 degrees of freedom
## Multiple R-squared:  0.7422, Adjusted R-squared:  0.7405 
## F-statistic: 440.4 on 1 and 153 DF,  p-value: < 2.2e-16
  • Through straight-line fitting, it is found that the weight and BMI show a linear upward trend, that is, the heavier the weight, the higher the BMI.
#darw on the scatter plot
c=ggplot(survey,aes(x=weight,y=bmi)) 
c + geom_point() + geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

  • The figure below illustrates the residuals (res) and the independent variable (weight) as a residual plot. This residual plot shows a fairly random pattern - “fan out” from left to right. Since the distribution of the residual graph is not completely random, this graph does not match the previous linear model. Therefore, it is concluded that weight is not the only factor that determines BMI.
#construct the residual plot
res=L$residuals
ggplot(survey,aes(weight,res))+geom_point()+geom_hline(yintercept = 0, colour="yellow")

3 Articles

In the relevant studies, it has claimed that BMI is closely related to weight(Hoor, Plasqui, Schols, Kok, 2018. Peterson, Thomas, Blackburn, Heymsfield, 2016). It also has stated that sleep duration more than 7 hours a day is associated with BMI decrease.(Sung, 2017).

4 References

Peterson, C. M., Thomas, D. M., Blackburn, G. L., & Heymsfield, S. B. (2016). Universal equation for estimating ideal body weight and body weight at any BMI. The American journal of clinical nutrition, 103(5), 1197–1203.

Sung, B. (2017). Analysis of the Relationship between Sleep Duration and Body Mass Index in a South Korean Adult Population: A Propensity Score Matching Approach. Journal Of Lifestyle Medicine, 7(2): 76–83.. doi: doi: 10.15280/jlm.2017.7.2.76

Weight-height relationships and body mass index: Some observations from the diverse populations collaboration. (2005). American Journal Of Physical Anthropology, 128(1), 220-229. doi: 10.1002/ajpa.20107

Style: APA


5 Acknowledgements

  • April 4 13:00-15:00 Discuss and determine the research topic (4 persons together)
  • April 18 13:30-15:30 Designed the questionnaire (4 persons together)
  • April 24 Clean the data, do some IDA work and discuss the analysis of research question (4 persons together)
  • April 28 Produce the report, IDA (by Zhang); Reference collect and arrange(by You);Research question 1 and 2 (by Xu and Yao);Final linear model (4 persons together)
  • April 29 Prepare presentation (4 persons together)
  • April 30 Presentation recording (4 persons together)